DATS 6101 Team 1 - Project 1

Introduction

Our research topic aims to perform explorartory data analysis on the Nearest Earth Objects (NEO) dataset recorded by NASA. This particular dataset includes thousands of labeled observations, over the last 22 years, across multiple attributes that each aim to describe a specific object. Borrowing from the humor of the recent film Don’t Look Up by Adam Mckay and NASA’s success in the Double Asteroid Redirection Test (DART) mission, we thought that further describing and exploring this dataset would be opportune.

The EDA performed can help with the early detection of an asteroid. With the early detection, NASA can take quick remedies to help eliminate the threat. NASA’s DART mission (Double Asteroid Redirection Test) was done for the same purpose. It was designed to deflect an asteroid with a certain momentum by hitting it head-on or attempting to slow it down. More can be read online from this article.

According to NASA’s official documentation, a PHA is technically termed using parameters that measure the asteroid’s characteristics and then make a decision if the asteroid is considered hazardous based on certain thresholds. Specifically, all asteroids with an Earth Minimum Orbit intersection Distance (MOID) of 0.05 au or less and an absolute magnitude (H) of 22.0 or less are considered Potentially Hazardous Asteroids (PHAs). In other words, asteroids that can’t get any closer to the Earth (i.e., MOID) than 0.05 au (roughly 7,480,000km) or are smaller than about 140m in diameter (i.e., H=22.0) are not considered PHAs.

The current dataset that we have contains 90,000 rows. Each row represents a NEO identified by an ID and name. Out of these 90,000 rows, 9,000 of them are hazardous and 81,000 of them are non-hazardous. This dataset does not contain null values. Apart from the hazardous target variable, our dataset contains 5 numerical variables that can help identify if the asteroid is hazardous or not. These 5 variables are:

  1. est_diameter_min (km) – estimated minimum diameter of the NEO
  2. est_diameter_max (km) – estimated maximum diameter of the NEO
  3. relative_velocity – the velocity (in km/seconds) with which the NEO was travelling with respect to Earth
  4. miss_distance – distance (in km) by which the asteroid missed the Earth’s surface
  5. absolute_magnitude (H) – signifies the brightness of the asteroid

Out of these 5 variables, we have 3 variables that play a primary role in determining if the asteroid was harmful or not. Those are:

  1. relative_velocity
  2. miss_distance
  3. absolute_magnitude

Throughout the rest of this RMD file, we go over each of these 3 variables in complete depth to answer our SMART question – what are some of the statistical characteristics of Nearest Earth Objects, over the last 22 years, in terms of relative velocity, miss distance and absolute magnitude, that make them hazardous or not?

Let’s explore the dataset

Displaying the structure of the data:

## 'data.frame':    90836 obs. of  10 variables:
##  $ id                : int  2162635 2277475 2512244 3596030 3667127 54138696 54189957 54230078 2088213 3766065 ...
##  $ name              : chr  "162635 (2000 SS164)" "277475 (2005 WK4)" "512244 (2015 YE18)" "(2012 BV13)" ...
##  $ est_diameter_min  : num  1.1983 0.2658 0.722 0.0965 0.255 ...
##  $ est_diameter_max  : num  2.679 0.594 1.615 0.216 0.57 ...
##  $ relative_velocity : num  13569 73589 114259 24764 42738 ...
##  $ miss_distance     : num  54839744 61438127 49798725 25434973 46275567 ...
##  $ orbiting_body     : chr  "Earth" "Earth" "Earth" "Earth" ...
##  $ sentry_object     : chr  "False" "False" "False" "False" ...
##  $ absolute_magnitude: num  16.7 20 17.8 22.2 20.1 ...
##  $ hazardous         : chr  "False" "True" "False" "False" ...
## NULL

Displaying the first 6 rows of the dataset:

##         id                name est_diameter_min est_diameter_max
## 1  2162635 162635 (2000 SS164)           1.1983           2.6794
## 2  2277475   277475 (2005 WK4)           0.2658           0.5943
## 3  2512244  512244 (2015 YE18)           0.7220           1.6145
## 4  3596030         (2012 BV13)           0.0965           0.2158
## 5  3667127         (2014 GE35)           0.2550           0.5702
## 6 54138696         (2021 GY23)           0.0364           0.0813
##   relative_velocity miss_distance orbiting_body sentry_object
## 1             13569      54839744         Earth         False
## 2             73589      61438127         Earth         False
## 3            114259      49798725         Earth         False
## 4             24764      25434973         Earth         False
## 5             42738      46275567         Earth         False
## 6             34298      40585691         Earth         False
##   absolute_magnitude hazardous
## 1               16.7     False
## 2               20.0      True
## 3               17.8     False
## 4               22.2     False
## 5               20.1      True
## 6               24.3     False

The number of NA values in the dataset => 0

Now, we split the dataset into 2 dataframes - one for hazardous and other for non-hazardous. This is to ease out the tests performed.

The dimensions of hazardous asteroids dataframe: 8840, 10

The dimensions of non_hazardous asteroids dataframe: 81996, 10

Performing measures of central tendency using Summary

Summary of all asteroids is as follows:

##        id               name           est_diameter_min est_diameter_max
##  Min.   : 2000433   Length:90836       Min.   : 0.0     Min.   : 0.0    
##  1st Qu.: 3448110   Class :character   1st Qu.: 0.0     1st Qu.: 0.0    
##  Median : 3748362   Mode  :character   Median : 0.0     Median : 0.1    
##  Mean   :14382878                      Mean   : 0.1     Mean   : 0.3    
##  3rd Qu.: 3884023                      3rd Qu.: 0.1     3rd Qu.: 0.3    
##  Max.   :54275914                      Max.   :37.9     Max.   :84.7    
##  relative_velocity miss_distance      orbiting_body      sentry_object     
##  Min.   :   203    Min.   :    6746   Length:90836       Length:90836      
##  1st Qu.: 28619    1st Qu.:17210820   Class :character   Class :character  
##  Median : 44190    Median :37846579   Mode  :character   Mode  :character  
##  Mean   : 48067    Mean   :37066546                                        
##  3rd Qu.: 62924    3rd Qu.:56548996                                        
##  Max.   :236990    Max.   :74798651                                        
##  absolute_magnitude  hazardous        
##  Min.   : 9.2       Length:90836      
##  1st Qu.:21.3       Class :character  
##  Median :23.7       Mode  :character  
##  Mean   :23.5                         
##  3rd Qu.:25.7                         
##  Max.   :33.2

Distinct Values

Now, there are multiple asteroids with the same “id” and “name” in the dataset. This could happen due to multiple tests being performed on the asteroids or it could be another different reason. We shall remove these in order for the dataset to follow the rules of a relational dataset. We use the group_by function provided in R to do so and we take the mean of those asteroids with the same id and name so we don’t have any risk of losing the data.

The dimension of the hazardous dataframe after using only distinct asteroid ID’s => 2173, 5

The dimension of the non_hazardous dataframe after using only distinct asteroid ID’s => 25250, 5

Relative Velocity

Relative velocity of an NEO is defined as its velocity compared to the earth. It is measured in kilometers per second.

alt text here

Table: Statistics summary.
relative_velocity
Min Min. : 5908
Q1 1st Qu.: 43018
Median Median : 58658
Mean Mean : 62794
Q3 3rd Qu.: 78786
Max Max. :193387
Table: Statistics summary.
relative_velocity
Min Min. : 203
Q1 1st Qu.: 27636
Median Median : 42566
Mean Mean : 46479
Q3 3rd Qu.: 61062
Max Max. :236990
## 
##  Welch Two Sample t-test
## 
## data:  hazardous$relative_velocity and safe$relative_velocity
## t = 54, df = 10456, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  15724 16906
## sample estimates:
## mean of x mean of y 
##     62794     46479
## [1] 46311 46647
## attr(,"conf.level")
## [1] 0.95
## [1] 46258 46700
## attr(,"conf.level")
## [1] 0.99
## [1] 62228 63361
## attr(,"conf.level")
## [1] 0.95
## [1] 62050 63539
## attr(,"conf.level")
## [1] 0.99
Model: relative_velocity ~ hazardous
Df Sum Sq Mean Sq F value Pr(>F)
hazardous 1 2.12e+12 2.12e+12 3446 0
Residuals 90834 5.60e+13 6.16e+08 NA NA
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = relative_velocity ~ hazardous, data = neo)
## 
## $hazardous
##             diff   lwr   upr p adj
## True-False 16315 15770 16860     0

For basic descriptive statistics, measures of central tendency, mean and median, are all higher in the “hazardous” group compared to the objects deemed “safe.”

Using a 2-sample T-Test shows that the difference in means in two groups are significant at the 95% confidence level, as the P-value is smaller than .05.

After constructing confidence intervals at 95% and 99%, we see that the HIGHEST relative velocity in the “safe” condition is lower than the LOWEST relative velocity in the “hazardous” condition, giving more support to the correlation of high relative velocity associated with whether a NEO is hazardous.

## Outliers identified: 153 
## Proportion (%) of outliers: 1.8 
## Mean of the outliers: 145497 
## Mean without removing outliers: 62794 
## Mean if we remove outliers: 61338 
## Outliers successfully removed

## Outliers identified: 1367 
## Proportion (%) of outliers: 1.7 
## Mean of the outliers: 126135 
## Mean without removing outliers: 46479 
## Mean if we remove outliers: 45129 
## Outliers successfully removed

Relative velocity is one of the observable characteristics of the Kaggle dataset that other researchers in the field have looked at previously (Araujo 2014). Their research sought to identify the frequency of that asteroids would cross into potentially critical radii (“close encounters”) and, alongside impact parameter, relative velocity was one of the components the researchers examined. The researchers also used histograms to showcase the relative velocity component of their exploratory data analysis. With just a brief glance, the histograms clearly show that asteroids of the Atens groups suffered encounters with a higher relative velocity compared to the Amor group. Their overall findings were similar to ours, with asteroids and their satellites having a higher relative velocity representing the greatest probability of surviving the critical radii in the close encounter.

To estimate an asteroid’s destructive damage to Earth and her human-built infrastructure, we have baseline comparisons (as horrible as it is) in the area of high-yield nuclear weapons. To approximate asteroid impact energy as comparable to the yield of a high-explosive weapon in the megaton range, we would need to know the asteroid mass. Basic physics kinematics states that kinetic energy is a function of mass and velocity squared, KE= ½ MV^2. A one megaton explosion measures the explosive energy of one million tons of TNT, Trinitrotoluene, and is 4.18x10^15 Joules (Krehl 2008). The atomic bomb that dropped on Hiroshima was only 15 kilotons. Asteroids would yield easily in the megatons (Tonry 2010). Research on the former also requires mass to finalize the estimate of energy, and the scientists have a method to derive mass via density (usually taken as ∼2–3 g cm-3). They use the information to estimate the destructive capacity, as well as proposing early warning detection systems using what they know about asteroid destructive impact. Our work would equally benefit if we had asteroid mass (or a way to derive it).

Miss Distance

1. Descriptive statistics:

The miss distance, in kilometers relative to earth, was one of the variables of interest. We believe that this was one of the key factors that might influence the decision of whether an asteroid is hazardous or not. First, we began by sub-setting the data into neo_tyler with the variables of interest: name,miss_distance and hazardous. Further, we thought that it would seem appropriate to check whether the data confirms the range, of distance relative to earth, that Potentially Hazardous Asteroids PHAs have: “Potentially Hazardous Asteroids: NEAs whose Minimum Orbit Intersection Distance (MOID) with the Earth is 0.05 au or less” (NASA) Therefore, by converting the units from kilometers to astronomical units (au) and using the summary() function, we are able to confirm that these are indeed PHAs as defined by NASA. For the matters of this project, we thought that expressing the variable miss_distance in millions of kilometers relative to earth would be more comprehensible to the audience.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.115   0.253   0.248   0.378   0.500

After properly sub-setting the data, we noticed from the excel file that there are 27423 unique values. Thus, we thought that it would be most opportune to take an average of the multiple miss_distance values that appeared from the data set. To do this average by unique values, we used the group_by() function, which allowed us to simplify the data to our needs.

Finally, we further subsetted grouped data to hazardous neo_h_tyler and not hazardous neo_nh_tyler. In this manner, we can properly seek to answer our S.M.A.R.T. question in hand. Starting first by observing the descriptive statistics with the summary() function for each subset. As observed from these figures, the range is roughly the same for both subsets, due to the NASA definition of PHAs, and the mean of the hazardous data is slightly higher than those not hazardous. Also, we thought that it would be opportune to look at the variation inside of the subsetted data, by using the sd() function, and as it turns out, hazardous asteroids tend to be less variable than those that are not. Given that there is a slight difference in the means and variability, we thought that this would be the best moment visualize this output by means of a boxplot and histograms.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.8    27.2    38.0    35.8    45.4    74.7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    12.7    32.2    30.0    43.9    74.8
## [1] 14.4
## [1] 18.4

2. Graphical interpretations of the data:

(Notation: h - for hazardous subset and nh- for not hazardous subset)

  • Boxplots: As expected, the boxplots show that slight difference in means for h is slightly higher than for those nh. Additionally, the ranges are identical for both groups, however their frequencies might differ, so let’s take a look at the relative frequency histograms.

  • Relative Frequency Histograms: As for the Relative Frequency Histograms, there are both similitudes and differences in terms of relative frequency for both the h and nh subset. The h data appears to follow a normal distribution, with the majority of the frequency revolving around the 40 million kilometer mark, however there are higher frequencies in the 0-20 million kilometer mark. Similarly, the nh presents the same features as the h data, however the values in the 0-20 million kilometer mark are more exaggerated. Therefore, since the distribution for h looks closely to a normal distribution, we thought that this would be the most appropriate moment to run normality tests.

3. Normality Tests:

  • QQ-plots: As expected from the relative frequency histograms, the QQ-plot for h closely follows the QQ-line (meaning that is follows a normal distribution), however the values clearly depart in that 0-20 million kilometer range. On the other hand, the nh data is clearly under dispersed, visualized by the s-shape in the QQ-plot, meaning that there is excess negative kurtosis in the subsetted data. Now, that we have looked at the visual component of these subsets, we thought it would be best to revisit the difference in the means to check if h and nh have the same sampling distribution for when it comes to miss_distance.

4. T-test/Intervals

Two-tailed t-tests:

  • @ 95% confidence level: The lower confidence level for h (which recall had the highest mean between h and nh) was of 35.2 millions of kilometers, and as for the highest confidence level for nh was of 30.2 millions of kilometer. Since there is a 5 million kilometer difference in the means, we thought that it would be best to check at a higher confidence level.
## 
##  One Sample t-test
## 
## data:  neo_h_tyler$miss_distance
## t = 116, df = 2172, p-value <2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  35.2 36.5
## sample estimates:
## mean of x 
##      35.8
## 
##  One Sample t-test
## 
## data:  neo_nh_tyler$miss_distance
## t = 259, df = 25249, p-value <2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  29.8 30.2
## sample estimates:
## mean of x 
##        30

-@ 99% confidence level: The lower confidence level for h was of 35.1 millions of kilometers, and as for the highest confidence level for nh was of 30.3 millions of kilometer. This 4.8 million kilometer difference indicates that these means might not come from the same sampling distribution. Thus, we further explored this differences with an ANOVA test.

## 
##  One Sample t-test
## 
## data:  neo_h_tyler$miss_distance
## t = 116, df = 2172, p-value <2e-16
## alternative hypothesis: true mean is not equal to 0
## 99 percent confidence interval:
##  35.1 36.6
## sample estimates:
## mean of x 
##      35.8
## 
##  One Sample t-test
## 
## data:  neo_nh_tyler$miss_distance
## t = 259, df = 25249, p-value <2e-16
## alternative hypothesis: true mean is not equal to 0
## 99 percent confidence interval:
##  29.7 30.3
## sample estimates:
## mean of x 
##        30

5. ANOVA & Post-Hoc Tukey:

ANOVA: The ANOVA test yielded highly significant results. As indicated by the extremely small p-value of 2e-16, these means do not come from the same sampling distribution, and therefore are different in their nature. So, to get a more accurate reading on these differences we ran a Post-Hoc Tukey test to know by how much precisely do these means differ.

##                           Df  Sum Sq Mean Sq F value Pr(>F)    
## grouped_data$hazardous     1   68168   68168     207 <2e-16 ***
## Residuals              27421 9030616     329                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Post-Hoc Tukey: In the Post-Hoc Tukey test we were able to identify that the means differ by roughly 5 million kilometers to 7 million kilometers.

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = grouped_data$miss_distance ~ grouped_data$hazardous, data = neo_tyler)
## 
## $`grouped_data$hazardous`
##            diff  lwr  upr p adj
## True-False 5.84 5.04 6.63     0

6. Conclusion:

-Hazardous NEOs, on average, are slightly further away from Earth (roughly by 5-7 M kilometers) than those not hazardous.

-Hazardous NEOs vary less in their distance relative to Earth (σ < σ) than those not hazardous.

-Hazardous NEOs do not share the same sampling distribution of distance relative to Earth as those not hazardous.

Absolute Magnitude

Absolute magnitude is the third variable that we will be go over in detail. According to Wikipedia, it is defined as a measure of magnitude (brightness) of a celestial object as it would be seen at a standard distance of 10 parsecs. Absolute magnitude plays a very important role in determining if the NEO is hazardous or not. This is because the inverse square law states that the brightness of a light source decreases with the square of the size of that asteroid. Hence, this would indicate that absolute magnitude variable must have a certain relationship with the diameter variables of the asteroid (est_diameter_min and est_diameter_max). Also, from the official NASA documentation, it is worth pointing out that the absolute magnitude of the PHAs is always less than 22.0. Throughout the rest of the documentation, we shall go over plots and perform tests to prove whether the hazardous asteroid’s absolute magnitude lies below 22.0 H or no.

Before that, we shall explore some key characteristics of absolute_magnitude in detail:

Summary of absolute_magintude of hazardous asteroid:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    14.0    19.6    20.6    20.3    21.4    22.4

Summary of absolute_magnitude of non-hazardous NEO:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.2    22.1    24.1    23.9    25.9    33.2

Let’s find the mode

Mode of absolute_magnitude in neo dataset => 24.4

Mode of absolute_magnitude in hazardous dataset => 21.9

Mode of absolute_magnitude in non-hazardous dataset => 24.4

Measures of Variance and Standard Deviation

Variation of absolute_magnitude of hazardous NEO => 1.8

Variation of absolute_magnitude of non_hazardous NEO => 7.847

Standard deviation of absolute_magnitude of hazardous NEO => 1.342

Standard deviation of absolute_magnitude of non_hazardous NEO => 2.801

Graphical understanding using Plots

Note - We see that the absolute_magnitude for hazardous NEO lies mostly between 19 to 22. While that of the non-hazardous asteroids gnerally lies between 23 to 27.

This proves what was mentioned in NASA documentation that Potentially Hazardous Asteroids have an absolute magnitude of 22.0 H or less.

Outliers

Knowing if the dataset has outliers is a good way to know if the dataset has any “bad data”. And it is essential to know or keep track of the number of records that represent outliers. To keep track of the outliers in the dataset, we shall use outlierKD2 function:

## Outliers identified: 231 
## Proportion (%) of outliers: 2.7 
## Mean of the outliers: 16.1 
## Mean without removing outliers: 20.3 
## Mean if we remove outliers: 20.4 
## Outliers successfully removed

The dimension of the hazardous dataframe after removing outliers: 8840, 10

## Outliers identified: 446 
## Proportion (%) of outliers: 0.5 
## Mean of the outliers: 18.9 
## Mean without removing outliers: 23.9 
## Mean if we remove outliers: 23.9 
## Outliers successfully removed

The dimension of the non_hazardous dataframe after removing outliers: 81996, 10

T-Tests

Now, with all outliers removed and distinct values extracted, we can perform t-tests.

T-Tests on hazardous dataframe:

## 
##  One Sample t-test
## 
## data:  hazardous_grouped$absolute_magnitude
## t = 670, df = 2172, p-value <2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  20.1 20.3
## sample estimates:
## mean of x 
##      20.2

T-Tests on non_hazardous dataframe:

## 
##  One Sample t-test
## 
## data:  non_hazardous_grouped$absolute_magnitude
## t = 1341, df = 25249, p-value <2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  23.6 23.7
## sample estimates:
## mean of x 
##      23.6

Contingency Table

Now, we shall validate the above by printing the Contingency table. The contingency table of hazardous is below:

## 
## (16,16.5] (16.5,17] (17,17.5] (17.5,18] (18,18.5] (18.5,19] (19,19.5] (19.5,20] 
##        21        23        39        58       107       134       191       240 
## (20,20.5] (20.5,21] (21,21.5] (21.5,22] (22,22.5] (22.5,23] 
##       296       302       352       368        17         0

We see a higher frequency of values falling in the interval 21-22 for hazardous asteroids. However, as noticed in the plots drawn, the plot is left skewed. Hence, the average/mean shifted a bit towards the left.

Here too, we can confirm that the PHAs in our dataset follow the official NASA documentation.

The contingency table of non_hazardous is below:

## 
## (16,16.5] (16.5,17] (17,17.5] (17.5,18] (18,18.5] (18.5,19] (19,19.5] (19.5,20] 
##        76        92       161       233       401       527       706       795 
## (20,20.5] (20.5,21] (21,21.5] (21.5,22] (22,22.5] (22.5,23] (23,23.5] (23.5,24] 
##       913      1015       933       933      1349      1397      1600      1744 
## (24,24.5] (24.5,25] (25,25.5] (25.5,26] (26,26.5] (26.5,27] (27,27.5] (27.5,28] 
##      1905      1908      1830      1556      1430      1162       854       625 
## (28,28.5] (28.5,29] (29,29.5] (29.5,30] (30,30.5] (30.5,31] (31,31.5] (31.5,32] 
##       404       247       160        98        50        22        11        10

We see a higher frequency of values falling in the interval 24-25 for non_hazardous.

Anova Test:

For anova test, we need 2 samples of equal size - one from hazardous and another from non-hazardous

##               Df   Sum Sq Mean Sq  F value Pr(>F)    
## hazardous      1 1.60e+08 1.6e+08 80793733 <2e-16 ***
## Residuals   2172 4.29e+03 2.0e+00                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Observation - the p-value is a small number (<0.001) which indicates that we can reject the null hypothesis and confirm that the mean for hazardous and non-hazardous NEOs is not the same. Note that the p-value is same from results obtained in anova test and t-test.

Here, we finally prove that our dataset has the correct definition of PHAs and that they all have an absolute magnitude less than 22 H.

Correlation with Diameter

A correlation plot is used to determine how close each variable is to the other. We know from earlier that the absolute magnitude is inversely related to the size of the NEO. We shall draw a correlation plot with est_diameter_min and est_diameter_max to verify and see if it does hold significance or not?

As noted in the above correlation plot, we see that the the min and max estimated diameter is closely related to the absolute magnitude. It is inversely related. This makes sense because absolute magnitude (brightness of a NEO) indicates the size of the asteroid in question. And hence, we can say that the bigger the size of the asteroid, lesser the absolute magnitude (brighter the star), and thus potentially more hazardous is the asteroid.

Overall Correlation Plots:

After exploring each of these 3 variables, we shall now see their correlation with that of our target variable - hazardous. We shall draw both - Pearson and Spearman correlation plots.

Here, we see that hazardous variable has a higher correlation with absolute magnitude, and the diameter variables. This means that the size and brightness of the asteroid plays a very significant role in deciding if the asteroid is hazardous or not. These variables are followed by relative velocity and miss distance.

PROJECT 2

EDA ENDS

Splitting and Sampling

set.seed(123)
non_hazardous = non_hazardous[sample(nrow(non_hazardous),nrow(hazardous)),]
nasa_model_data= rbind(hazardous, non_hazardous)
nrow(nasa_model_data)
## [1] 17680
## 75% of the sample size
smp_size <- floor(0.75 * nrow(nasa_model_data))

## set the seed to make your partition reproducible
set.seed(123)
train_data_size <- sample(seq_len(nrow(nasa_model_data)), size = smp_size)


nasa_train.labels = nasa_model_data[train_data_size,c("hazardous")]
nasa_test.labels = nasa_model_data[-train_data_size,c("hazardous")]

nasa_model_data = nasa_model_data %>% select(est_diameter_min, est_diameter_max, relative_velocity, miss_distance, absolute_magnitude)
nasa_train <- nasa_model_data[train_data_size,]
nasa_test <- nasa_model_data[-train_data_size,]

Comment about why we are doing normalisation

Normalisation

nasa_train = scale(nasa_train, center = TRUE, scale = TRUE)
nasa_test = scale(nasa_test, center = TRUE, scale = TRUE)

KNN

library(class)
loadPkg("gmodels")
loadPkg("gmodels")
loadPkg("FNN")
loadPkg("caret")
knn.5 <- knn(train=nasa_train, test=nasa_test, cl=nasa_train.labels, k=5)
ACC.5=100 * sum(nasa_test.labels == knn.5)/NROW(nasa_test.labels)
paste0("Accuracy = ", ACC.5)
## [1] "Accuracy = 85.0904977375566"
print(confusionMatrix(knn.5, as.factor(nasa_test.labels) ))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1701  178
##          1  481 2060
##                                        
##                Accuracy : 0.851        
##                  95% CI : (0.84, 0.861)
##     No Information Rate : 0.506        
##     P-Value [Acc > NIR] : <2e-16       
##                                        
##                   Kappa : 0.701        
##                                        
##  Mcnemar's Test P-Value : <2e-16       
##                                        
##             Sensitivity : 0.780        
##             Specificity : 0.920        
##          Pos Pred Value : 0.905        
##          Neg Pred Value : 0.811        
##              Prevalence : 0.494        
##          Detection Rate : 0.385        
##    Detection Prevalence : 0.425        
##       Balanced Accuracy : 0.850        
##                                        
##        'Positive' Class : 0            
## 
ResultDf = data.frame( k=numeric(0), Total.Accuracy= numeric(0), row.names = NULL )

for (kval in seq(11,50,2)) {
  nasa_pred <- knn(train = nasa_train, test = nasa_test, cl=nasa_train.labels, k=kval)
  #NASA_PREDCross = CrossTable(nasa_test.labels, nasa_pred, prop.chisq = FALSE)
  cm = confusionMatrix(nasa_pred, as.factor(nasa_test.labels) ) # from caret library
  cmaccu = cm$overall['Accuracy']
  cmt = data.frame(k=kval, Total.Accuracy = cmaccu, row.names = NULL ) # initialize a row of the metrics 
  ResultDf = rbind(ResultDf, cmt*100)
}
xkabledply(ResultDf, "Total Accuracy Summary:")
Total Accuracy Summary:
k Total.Accuracy
1100 86.3
1300 86.8
1500 86.8
1700 86.7
1900 86.7
2100 87.0
2300 87.0
2500 87.1
2700 87.2
2900 87.2
3100 87.1
3300 87.2
3500 87.1
3700 87.1
3900 87.1
4100 87.1
4300 87.0
4500 86.9
4700 86.8
4900 86.9
ggplot(data=ResultDf, aes(x=k, y=Total.Accuracy,group=1)) +
  geom_line(color="#aa0022", size=1.75) +
  geom_point(color="#aa0022", size=3.5) +
  ggtitle("K value against their accuracy in KNN model") +
  labs(x="K value", y="Accuracy of K value") +
  theme(axis.title.y = element_text(size=10, family="Trebuchet MS", color="#666666")) +
  theme(axis.text = element_text(size=14, family="Trebuchet MS")) +
  theme(plot.title = element_text(size=20, family="Trebuchet MS", face="bold", hjust=0, color="#666666"))

From the above, we see that when k=32, we get the highest accuracy of 87.2%. Also, noticed a higher sensitivity in the model. Why? IDK.

SVM

# install.packages("e1071")
library("e1071")
classifier = svm(formula = nasa_train.labels ~ .,
                 data = nasa_train,
                 type = 'C-classification',
                 kernel = 'linear',cost = 10)
y_pred = predict(classifier, newdata = nasa_test)

cm = confusionMatrix(y_pred, as.factor(nasa_test.labels) )
cm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1612   48
##          1  570 2190
##                                       
##                Accuracy : 0.86        
##                  95% CI : (0.85, 0.87)
##     No Information Rate : 0.506       
##     P-Value [Acc > NIR] : <2e-16      
##                                       
##                   Kappa : 0.719       
##                                       
##  Mcnemar's Test P-Value : <2e-16      
##                                       
##             Sensitivity : 0.739       
##             Specificity : 0.979       
##          Pos Pred Value : 0.971       
##          Neg Pred Value : 0.793       
##              Prevalence : 0.494       
##          Detection Rate : 0.365       
##    Detection Prevalence : 0.376       
##       Balanced Accuracy : 0.859       
##                                       
##        'Positive' Class : 0           
## 

Random Forest

nasa_model_data_randomforest= rbind(hazardous, non_hazardous)
nasa_model_data_randomforest$hazardous = as.factor(nasa_model_data_randomforest$hazardous)
## 75% of the sample size
smp_size <- floor(0.75 * nrow(nasa_model_data_randomforest))

## set the seed to make your partition reproducible
set.seed(123)
train_data_size <- sample(seq_len(nrow(nasa_model_data)), size = smp_size)

nasa_train_randomforest <- nasa_model_data_randomforest[train_data_size,]
nasa_test_randomforest <- nasa_model_data_randomforest[-train_data_size,]


library(randomForest)
library("pROC") 
classifier <- randomForest( formula = hazardous~., data=nasa_train_randomforest)
print(classifier)
## 
## Call:
##  randomForest(formula = hazardous ~ ., data = nasa_train_randomforest) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 9.69%
## Confusion matrix:
##      0    1 class.error
## 0 5541 1117      0.1678
## 1  168 6434      0.0254
pred_1 = predict(classifier, nasa_test_randomforest)
paste0("Accuracy = ",mean(nasa_test_randomforest$hazardous == pred_1))
## [1] "Accuracy = 0.880769230769231"
nasa_test_randomforest$prob = pred_1
nasa_test_randomforest$hazardous = as.numeric(nasa_test_randomforest$hazardous)
nasa_test_randomforest$prob = as.numeric(nasa_test_randomforest$prob)
h <- roc(hazardous~prob, data=nasa_test_randomforest)
print(h)
## 
## Call:
## roc.formula(formula = hazardous ~ prob, data = nasa_test_randomforest)
## 
## Data: prob in 2182 controls (hazardous 1) < 2238 cases (hazardous 2).
## Area under the curve: 0.88
plot(h,main="ROC Curve for Random Forest",col=2,lwd=2)

varImpPlot(classifier)

Classifier estimate of error = 9.7%. Which means that accuracy is 90.3%. Area under curve is 0.908. Very good fit model.

Limitations

The neo.csv data set posed limitations in the variables, description and dictionary. Since the dataset was obtained from a web-based environment, kaggle, we had insufficient variables, a poor description of the dataset and no data dictionary provided. In terms of the variables provided, it would have been ideal to have one regarding the trajectory towards, or away, from earth, as well as other potentially hazardous, or not hazardous, variables. Nevertheless, we were restricted to just using the three variables that were quantifiable in order to characterize the potential of it being hazardous. In retrospect, just using these three variables might have been an over-simplification in our determination of a hazard. On the other hand, the description of the dataset is very limited given the constraint of using Kaggle. It would have been better if the dataset was obtained straight from a NASA website or API, however this was not possible. It would have been a huge perk to know how these variables are being used by NASA professionals, what characterizes an instance to be recorded multiple times, and why they initially attributed the boolean value for Hazardous. Finally, when it came to providing a data dictionary, the kaggle website did not have one. Thus, we had to do our own research as to what these variables meant, and what they were describing. Therefore, with insufficient variables, a poor description and no data dictionary it was difficult to infer, or fully understand, the root of a hazardous or not hazardous asteroid to fully answer our question.

References: